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[57] ABSTRACT 

Voice activation of functions on a network such as the 
Internet are accomplished using a speech recognition system 
running synchronously with standard desktop-based Internet 
functions. This synchronous operation allows voice-based 
control to be exercised for all operations on the Internet. 
System functions are based on a unique combination of a 
local web browser, a remotely-located speech/web server, 
and control links between a web browser and a speech/web 
server. The control links provide a mechanism for control- 
ling a speech server from a web page and a mechanism for 
driving both the local, as well as a remote, web browser. 

19 Claims, 7 Drawing Sheets 
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USING SPEECH RECOGNITION TO ACCESS 
THE INTERNET, INCLUDING ACCESS VIA A 
TELEPHONE 

TITLE OF THE INVENTION 

Method and System for Using Speech Recognition to 
Access the Internet, including Access Via a Telephone. 

FIELD OF THE INVENTION 

The present invention relates to the field of computerized 
communication on the Internet. The general purpose of this 
invention is to enable speech access to the Internet over 
standard telephone lines and Internet control of telephony 
functions through standard web pages. This is accomplished 
through a unique combination of speech server, web 
browser, and control links. The control links provide a 
mechanism for controlling the speech server from a web 
page and a mechanism for driving both the local, as well as 
a remote, web browser. 

BACKGROUND OF THE INVENTION 

The Internet is essentially a network of servers containing 
information that users can obtain using personal computers. 
Users generally connect to a server, a computer equipped 
with information and capabilities that assist the user with 
contacting other servers and obtaining additional informa- 
tion. Users typically execute these functions, also referred to 
as "navigating" on the Internet, using a mouse and 
Windows-based software. The user's navigation of the Inter- 
net is thus essentially graphically-based (looking at a screen) 
with functions activated using a mouse. 

Speech recognition software and hardware for use in 
conjunction with personal computers and other 
environments, like the Internet, is a rapidly developing 
technology. With speech recognition, a user's voice com- 
mands are recognized by a computer and then converted, 
based on the speech pattern, into an electronic signal. For 
example, speech recognition has been highly successful in 
the field of long-distance telephone calling for the purpose 
of allowing collect calls. Typically, with this application, a 
caller will provide a name and a phone number to a 
computer when making a collect call. The computer will 
then place the caller on hold and call the number to be 
reached. The person receiving the collect call will answer 
"yes" or "no" in response to the computer message and the 
collect caller's name. The voice recognition hardware and 
software, which is also known as a speech recognition 
engine, either signals a switch to complete the call upon 
recognizing the "yes" response, or to disconnect upon rec- 
ognizing the "no" response. 

One issue with using speech recognition is selecting the 
appropriate speech recognition engine to use for a particular 
application. These speech recognition engines include 
speaker dependent and independent dictation machines, 
continuous speech systems, large vocabulary systems, and 
small vocabulary systems. Further, these systems can be 
Windows based, Macintosh based, UNIX based, Windows 
NT based, or based on another platform, depending on the 
preferred operating system. 

Speech recognition operating in conjunction with com- 
puter connection with the Internet, also known as speech 
enabling of the Internet, appears to have promising appli- 
cation possibilities. One possible application of this tech- 
nology is for navigational purposes on the Internet. For 
example, speech recognition has been successfully utilized 
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at the desktop level generally. Voice macros have been 
created for a number of Windows functions for use on the 
Internet. A macro is a series of functions on the computer 
activated by a single command. For a voice macro, the 

5 speech server's recognition of an inputted voice command 
activates a series of commands. 

Two prior art methods for speech-enabling the Internet 
have been explored by various companies and research 
entities. In general terms, researchers have approached the 

30 problem from either the perspective of speech-enabling the 
Internet, or from the perspective of Internet-enabling the 
telephone system. 

The first method is the most common approach and the 
one being pursued by Texas Instruments, Apple Computer, 

15 and Microsoft. In this approach, the speech recognition 
engine is located on the local host, along with the web 
browser. This approach allows such activities as those 
described above — voice macros for Windows functions that 
can be used when browsing the Internet. 

20 Texas Instruments further refined this approach by using 
the text associated with hotlinks to supply the vocabularies 
for the recognizer. Apple has taken the approach of making 
both the web browser and the speech recognition engine 
scriptable (controllable with the AppleScript language). 

25 Microsoft has taken the approach of providing tools for web 
page developers to allow them to speech-enable their web 
pages. These tools provide a mechanism for supplying the 
recognizer with grammars and their speech synthesizers 
with spoken prompts. 

The advantages of the present invention over this method 
include: (1) telephone access serves a far greater potential 
audience than speech access limited to desktop operations; 
(2) no additional requirements of the user's computer, such 

35 as a speech recognition engine, are required; (3) the system 
uses a migration path starting with an immediate utility with 
no long-term limitations; and (4) direct benefits are available 
from telephony integration. 
Internet-enabling the telephone system is primarily being 

40 investigated as a research effort. Demonstrations from MIT 
and the Sun SpeechActs group have shown potential for 
using a speech-only interface for retrieving personal infor- 
mation (voice e-mail) over the phone and for using the 
Internet as an up-to-date repository of information available 

45 over the phone. For example, ALTech, a commercial spin-off 
of MIT, has demonstrated the use of a speech server for 
obtaining information about local movies. 

Advantages of the present invention over this method 
include: (1) an optional Graphical User Interface (GUI) 

50 makes using the system with today's World Wide Web much 
more practical and simple than attempting to do it with 
speech alone; (2) the potential user base is just as large over 
the long term; and (3) providing tools to other developers is 
expected to lead to much more rapid progress than attempt- 

55 ing to build speech -only interfaces from the ground up. 

SUMMARY OF THE INVENTION 

This invention links networks such as the Internet and the 
World Wide Web to a speech recognition server, which 

60 resides on the telephone system, to provide for speech access 
to these networks over standard telephone lines and control 
of telephony functions through standard web pages. These 
capabilities are accomplished through a combination of 
speech server (typical of those found in Interactive Voice 

65 Response (IVR) applications), web browser, and control 
links. The control links consist of software that provides a 
mechanism for controlling the speech server from a web 
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page, and a mechanism for driving both the local, as well as follows, and in part will become more apparent to those 

a remote, web browser. skilled in the art upon examination of the following or may 

An example of the capabilities of the system is as follows. be learned by practice of the invention. The objects and 

A user seeking a service to provide stock quotes can access advantages of the invention may be realized and attained by 

these quotes by graphically browsing the Internet to a web 5 means of the instrumentalities and combinations particularly 

page that continually carries the quotes. Once at the web pointed out in the appended claims, 

page the user can activate the present invention telling the Tq achieve the stated and Qther objects of the m 

speech server to, for example "mark this or show me the as emb odied and described below, the invention 

stock quote . The server can then be set to either tell the user . . u . , 

t n , . , . c may comprise trie steps 01: 

the stock price or go to that web page upon recognizing 01 10 J * r 

the selected speech pattern. * accessing a voice recognition server through a voice 

The general purpose of this invention is thus to provide a transmission device; 

method for linking a remote speech recognition device such device translating voice transmissions into electronic 

operating over the telephone network to any web browser signals; and 

operating over the Internet. This link enables the user's web J5 us i ng sa id translated voice transmissions to perform func- 

browser to be controlled by the remote speech recognition tions on the Internet via voice translation being per- 

device, and, in turn, enables telephony functions to be formed by said server, 
controlled by any web browser. In addition to providing an 

immediate solution to accessing the web by voice, the BRIEF DESCRIPTION OF THE DRAWINGS 

invention provides tools and motivation for web page „ . L1 - u • 1 

. v . , , .« j » t 20 A block diagram of the invention is shown in FIG. 1. 

authors to generate web pages that are tailored to speech- 0 

only interfaces. This is expected to transform the nature of FIG. 2 shows how a user that happens across the web page 

the web, and, over time, to support a truly multi-modal containing connection information on the present mvention 

interface with the Internet initiates the process of speech enabling his or her web 

Hie significance of the invention is that it provides both 25 browser usin S the P referred embodiment, 

a means for immediately speech-enabling the Internet and a FIG. 3 illustrates the exchange of information necessary 

means for gradually Internet-enabling the telephone system. to speech enable a web browser. 

Other systems have approached the problem of linking FIG. 4 shows the connections in place for operation of the 

speech technology and the Internet from either one perspec- preferred embodiment. 

tive or the other (that is, speech-enabling the net or net- 30 piQ, 5 illustrates all of the components of the system in 

enabling the telephone). The approach of the present operation. 

invention, however, can be viewed from either perspective FJG 6 ConiiiiQS an alternative embodiment, in which the 

and, in so doing, leads to an immediate speech-enabling of ^ web browsef fa a slave tQ ^ ch/v/cb server . 

the Internet, and to a process of Internet-enabling the tele- m , , , . , 

phone. In addition, the present invention leads to function- 35 J con,al ^ a u second alternative embodiment, in 

ality completely unobtainable from either of the other * hich the speech/web server is a slave to the local web 

approaches taken alone. browser. 

The control of both the server's web browser and the DETAILED DESCRIPTION OF THE 

user's remote web browser also enables an optional GUI for PREFERRED EMBODIMENT 

the user of the Speech/web server. Hie GUI link is not 40 

required for the system to operate; however, because the web Usin g the drawings, the preferred embodiment of the 
is currently graphically-oriented, the ability to use the local P resent invention will now be explained, 
web browser as a GUI for the speech-driven browser is A block diagram of the invention is shown in FIG. 1. A 
expected to be beneficial when surfing the web by voice. The local web browser 1, such as Netscape on a PC, is used to 
concept of a telephony-based web browser with an optional 45 browse the Internet 2 using a conventional Transport Control 
GUI constitutes a significant attribute of the system because Protocol (TCP) link 3. The local web browser 1 contains an 
it provides a common platform that can be used for simple Applied Speech Technology Protocol (ASTP) plugin 4, 
applications by anyone with a telephone. In addition, it can which communicates by ASTP link 5 with an ASTP con- 
be used for more difficult tasks when a PC or workstation is troller 6 located within a speech/web browser 7 of a speech/ 
available to the user. 50 weD server 8, such as a Pentium processor-based PC running 
Another example of the use of the present invention Windows NT. This PC also hosts, or a separate PC coupled 
pertains to speech input and output over telephone lines as t0 me speech/web server 8 hosts, the speech server 9, which 
the additional modality that can be finked to the conven- ^ coupled 10 to the ASTP controller 6. These couples can 
tional web browser interface. Thus, rather than placing a consist of such connections as an electronic circuit, a fiber 
call, hanging up, and placing another call, a user will be able 55 °P tic line » m electromagnetic signal, or any other means of 
to browse using the telephone. This browsing includes such coupling known in the art. A Dialogic line card located in the 
activities as seamlessly speaking to one person, and then backplane of the speech server 9 PC couples 11 the speech 
connecting to another, and then checking messages and server to a telephone network 12. The speech/web browser 
ordering a pizza, all without hanging up and without ever 7 is also TCP linked 13 to the Internet 2. 
dialing a number. The same method links any alternative 60 The three major components of the speech/web server 
user interface to the user's standard web browser. This thus are the speech/web browser 7, the speech server 9 with 
pertains to browsers with teletypewriter (TTY) interfaces, telephony functions, and the ASTP controller software 4 and 
browsers that understand and speak other languages, or even 6. 

browsers capable of providing a sense of smell, sight, taste, The speech/web browser 7 is a standard, off-the-shelf web 

and touch. 65 browser with an ASTP plug-in 6. The ASTP plug-in 6, as 

Additional objects, advantages and novel features of the described below, is a software program written in a 

invention will be set forth in part in the description which language, such as JAVA, that allows the program to run 
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within a web browser, such as Netscape. However, since the the local web browser 16 the local telephone number and 
speech/web browser 7 is driven by speech-only, it is always downloads 21 the ASTP plug-in from the speech/web server 
run in text-only mode. This gives it a considerable response \g \ a pig. 4, the user 15 of the local web browser 16 and 
time advantage over a browser that must download and loca j telephone 19 simultaneously connects by ASTP con- 
display graphics. The time normally devoted to graphics can 5 neclion 1? and b tel hone conne ction 22 with the speech/ 
thus be used by the recognizer (speech server 4) to compile ^ 
the grammar for the new web page. 

The speech server 9 is typical of those used for IVR and The setup of the preferred embodiment is now completed, 

operator assist applications. These systems vary consider- as shown in FIG. 5. The user 15 of the local web browser 16 

ably in the number of simultaneous channels of speech 10 and local telephone 19 simultaneously communicates with 

recognition they can support, but are most often built from the speech/web server 18 via ASTP connection 17 and 

off-the-shelf components that plug into a PC (AT bus). A telephone connection 22. The user 15 is also connected by 

typical configuration for a speech server would be a Pentium a TCP link 25 to other web servers 24 simultaneously 26 

class PC running UNIX or Windows NT, loaded with a ^ the sp eech/web server 18 connection by a TCP link 23 

speech recognizer such as ALTech, PureSpeech, or Nuance, 15 ^ thQse o[h ^ ^ ^ 
with a Dialogic line card capable of handling multiple 

simultaneous telephone lines, and two speech recognition As a result of these simultaneous links 26, the user can 

boards, each with four channels of recognition. Speech browse the Internet using voice while looking at the screen 

output is either from pre-recorded prompts or a speech of the local web browser 16 and speaking over the phone 19. 

synthesizer. The telephone line card enables the system to 20 Typically these links allow a user to speak into the phone 

dial out, receive calls, and to conference calls. using words within the system's capability. These words are 

The ASTP software 4 and 6 is the heart of the system. As recognized and interpreted by the speech/web browser 

noted, this software is written and distributed as a plug-in located at the speech/web server 18 and translated into a 

module to Netscape or other browsers and is written in a TCP link 23 command for the speech/web browser at the 

typical software that can operate in Netscape, such as JAVA. 25 speech/web server 18. At the same time, the ASTP supplies 

The protocol is a superset of the Common Client Interface the same TCP link command 17 on the local web browser 

(CCI), which provides the mechanism for establishing a 15. Thus, the user 15 speaks to control browsing of the 

persistent link between the speech/web browser 7 and the Internet, 
user's browser (local web browser 1). The persistent link 

enables the speech/web browser 7 to remotely control the 30 A significant advantage of the preferred embodiment is 

user's web browser 1, the user's web browser 1 to control responsiveness. The dual link approach allows time for the 

the speech/web browser 7, and also allows the two browsers speech/web server to generate grammars while the user's 

1 and 7 to traverse the web in tandem. browser is busy displaying graphics. A secondary advantage 

In addition to the CCI-like capability, the ASTP protocols 35 ^ that neither of the web browsers need to be modified for 

provide the interface to the speech server 9, telling the the system to work, 

recognizer what grammar to compile for the next web page. Variation and Modifications 

This function is typically fulfilled by simply stripping the ^ ... ... .„ . . A . r^r, , 

4 , . « • ji j* * * , i Two variations on the invention are illustrated in FIGS. 6 

text associated with each hotlink and sending it to the , _ _ . ,.«. + , ., , . 

Al4 4 . , & . c and 7. These approaches differ from the one described in 

recognizer s grammar compiler. Alternatively, versions of „ . , , rr . . ,. , . , . T A 

the protocol support calls to high-level routines, called 40 ?' !? * ^ require only a sin S^ e UQ k int0 tne Internet, 

"speech behaviors", that handle all of the dialog between the rather lhao two lmks bribed previously, 

user and the machine. ITiese high-level routines allow users In the method shown in FIG. 6, the local web browser 1 

to supply, by voice, specific kinds of information when using with ASTP plug-in 4 is linked 5 to an ASTP controller 6 

the Internet, such as credit card numbers, addresses, and 45 located within a speech/web browser 7 housed within a 

telephone numbers. By providing web page authors with Pentium processor PC-based speech/web server 8. This PC 

access to well-designed dialog modules that can be easily is typically running Windows. This PC also hosts, or a 

deployed through simple-to-use web authoring tools, such as separate PC coupled to the speech/web server 8 hosts, the 

the ASTP protocols, the predominately graphical nature of speech server 9, which is coupled 10 to the ASTP controller 

the web changes to accommodate a speech-only, telephone- 5Q 6 nc speech 9 ^ n to a telephone network 

based interface. 12 The speech/web browser 7 is also TCP linked 13 to the 

Finally, the ASTP link 5 is what provides the conduit Internet 2. 

between the web page and the telephone. This allows web ^ : difference between this alternative and the 

authors to include telephone numbers associated with hot- .. . /r ™ n . t , t *r 1 1* j 

... , lj-ij. . ,_ , , oTT_- earlier embodiment (FIG. 1) is that a direct link 13 does not 

links that can be dialed by the speech/web server 8. This „ . ., ,1 ^ j . » ^ 

, \. 4l , 53 exist between the speech/web browser 6 and the Internet 2 

capability may change how switching is currently done in . ...... , , , . , 

the telephone network 12. simultaneous with a link between the local web browser 1 

\ , . # , . and the Internet 2 (link 3 of FIG. 1). 

FIG. 2 shows how a user that happens across the web page v 7 

containing connection information on the present invention In the method shown in FIG. 7, the locai web browser 1 

initiates the process of speech enabling his or her web 60 with plug-in 4 is linked 5 to an ASTP controller 6 

browser using the preferred embodiment. A user 15 using a located within a speech/web browser 7 housed within a 

local web browser 16 initiates a TCP connection 17 with the Pentium processor-based PC speech/web server 8. This PC 

speech/web site, which is served by the speech/web server also hosts, or a separate PC coupled to the speech/web server 

18, by selecting a hotlink such as "surf the web by voice" at 8 hosts, the speech server 9, which is coupled 10 to the ASTP 

the web site. 65 controller 6. The speech server 9 is linked 11 to a telephone 

In FIG. 3, user 15 of a local web browser 16 and local network 12. The local web browser 1 is also TCP linked 3 

telephone 19 uploads 20 to the speech/web server 18 from to the Internet 2. 
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The primary difference between this alternative and the 
earlier embodiment (FIG. 1) is that a direct link does not 
exist between the speech/web browser 6 and the Internet 2 
(link 13 of FIG. 1) simultaneous with a link 3 between the 
local web browser 1 and the Internet 2. 5 

What is claimed is: 

1. A remote server to enable a local user to increase the 
functionality of a local browser having a graphical user 
interface, comprising: 

a remote web browser residing on the remote server; 10 

a speech controller electronically coupled to said remote 
web browser, said controller being configured to form 
control links coupling the local browser to said remote 
browser via an Internet data communication link to 
enable said remote web browser and the local browser 15 
to function cooperatively; and 

a speech server having a speech recognition function 
residing on the remote server, said speech server cou- 
pling said controller to a telephone network so that a 2Q 
telephonic voice communication link may be estab- 
lished between the user and said controller; 

wherein voice commands to control browsing may be 
input via said telephonic voice communication link and 
wherein graphical user interface commands to control 2 5 
browsing may also be input via the local browser. 

2. The server of claim 1, wherein said controller and said 
server are configured to form said telephonic voice commu- 
nication link in response to the user accessing a web site via 
said Internet data communication link. 30 

3. The server of claim 1 wherein said control links are 
configured to enable the local browser to control the tele- 
phonic function of said speech server, 

4. The server of claim 1, wherein said controller is a 
software module contained in said remote browser. 35 

5. The server of claim 4, wherein said controller is 
configured to download a software program to the local 
browser to form persistent control links. 

6. A remote server to enable a local user to increase the 
functionality of a local browser, comprising: 4 q 

a remote web browser residing on the remote server; 

a speech controller electronically coupled to said remote 
web browser, said controller being configured to form 
control links coupling the local browser to said remote 
browser via an Internet data communication link to 45 
enable said remote web browser and the local browser 
to function cooperatively; and 

a speech server having a speech recognition function 
residing on the remote server, said speech server cou- 
pling said controller to a telephone network so that a 50 
voice communication link may be established between 
the user and said controller; 

wherein said control links are configured to enable voice 
commands to be uploaded to control the browsing 55 
function while information from the Internet is down- 
loaded to a graphical user interface of the local browser. 

7. The server of claim 6, wherein said control links are 
configured so that the user may browse by both voice 
commands and by inputting commands via said graphical 6Q 
user interface. 

8. A network system, comprising: 

a) a local browser disposed on a local computer; and 

b) a remote server including: 

i) a remote browser residing on said remote server; 65 

ii) a speech controller software module electronically 
coupled to said remote browser; and 



473 

8 

iii) a speech server having a speech recognition func- 
tion residing on the remote server, said speech server 
coupling said speech controller software module to a 
telephone network so that a voice communication 
link may be established between the user and said 
speech controller software module; 
said controller software module having an interface pro- 
tocol for remotely controlling web browsers configured 
to form control links coupling said local browser to said 
remote browser via a network data link to enable said 
remote web browser and said local browser to function 
cooperatively, wherein said control links are configured 
so that auxiliary voice commands may be input by the 
user to control browsing of the network. 

9. The system of claim 8, wherein said local browser 
includes a graphical user interface and said control links are 
configured so that the user may browse by both voice 
commands and by inputting commands via said graphical 
user interface. 

10. A network system, comprising: 

a) a local browser disposed on a local computer; and 

b) a remote server including: 

i) a remote browser residing on said remote server; 

ii) a speech controller software module electronically 
coupled to said remote browser, said controller soft- 
ware module being configured to form control links 
coupling said local browser to said remote browser 
via a network data link to enable said remote web 
browser and said local browser to function coopera- 
tively; and 

iii) a speech server having a speech recognition func- 
tion residing on the remote server, said speech server 
coupling said speech controller software module to a 
telephone network so that a voice communication 
link may be established between the user and said 
speech controller software module; 

wherein said control links are configured to enable voice 
commands to be uploaded to control the browsing 
function while information from the network is down- 
loaded to the graphical user interface of said local 
browser. 

11. The system of claim 10 wherein said controller 
software module includes an interface protocol for remotely 
controlling a web browser. 

12. A method for permitting a local user to link a local 
web browser to a remote speech recognition device, com- 
prising the steps of: 

a) electronically coupling the local browser to a web-site 
served by a remote server; 

b) downloading a software program from a remote web 
browser residing on said remote server to form control 
links between the local web browser and a controller 
coupled to said remote web browser; and 

c) telephoning the user to form a voice communication 
link between the user and said controller via a speech 
server coupling said controller to a telephone network; 

whereby the user may input voice commands which are 
translated by said speech server to control browsing of 
a computer network while information from the net- 
work is downloaded to a graphical user interface of the 
local browser. 

13. The method of claim 12, further comprising after step 
"b" the step of: uploading the phone number of the local 
user. 

14. The method of claim 12, wherein said controller 
software module is contained in said remote web browser. 



05/22/2003, EAST Version: 1.03.0007 



6,101,473 



10 



15. A method for permitting a local user to use voice 
commands to perform functions on a network, comprising 
the steps of: 

a) providing a remote server, the remote server having a 
controller for forming a first data communication link 
with a local user and a speech server for converting 
voice commands into control signals; 

b) accessing said remote server to form a first electronic 
communication link to a local browser; 

c) telephoning the user to form a voice transmission 
communication link coupling the user to the controller 
via said speech server; 

d) translating voice commands into electronic data signals 
using said speech server; and 

e) using said translated voice commands to perform 
functions on the network; 

wherein said controller is configured to enable voice 
commands to be uploaded to control the browsing 
function while information from the network is down- 
loaded to a graphical user interface of the local browser. 



16. The method of claim 15, wherein the network com- 
prises the Internet and further wherein the controller is 
contained in a remote browser residing on said remote 
server. 

5 17. The method of claim 16, wherein said speech server 
is coupled to a telephone network and further comprising 
after step "b" the step of: 
uploading a local telephone number. 

18. The method of claim 15, further comprising the step 

10 of: 

downloading a software program to said local browser to 
enable a persistent link to be formed between the 
controller and the local browser. 

19. The method of claim 15, wherein said speech server 
15 is coupled to a telephone network and further comprising the 

step of: 

accessing a hot-linked phone number on a web-site to 
initiate dialing of said phone number by said speech 
server. 

20 
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